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Abstract — Compensating changes between a subjects' training 
and testing session in Brain Computer Interfacing (BCI) is 
challenging but of great importance for a robust BCI operation. 
We show that such changes are very similar between subjects, 
thus can be reliably estimated using data from other users 
and utilized to construct an invariant feature space. This novel 
approach to learning from other subjects aims to reduce the 
adverse effects of common non-stationarities, but does not trans- 
fer discriminative information. This is an important conceptual 
difference to standard multi-subject methods that e.g. improve 
the covariance matrix estimation by shrinking it towards the 
average of other users or construct a global feature space. 
These methods do not reduces the shift between training and 
test data and may produce poor results when subjects have 
very different signal characteristics. In this paper we compare 
our approach to two state-of-the-art multi-subject methods on 
toy data and two data sets of EEG recordings from subjects 
performing motor imagery. We show that it can not only achieve 
a significant increase in performance, but also that the extracted 
change patterns allow for a neurophysiologically meaningful 
interpretation. 

Index Terms — Brain-Computer Interface, Common Spatial 
Patterns, Non-Stationarity, Transfer Learning. 



I. Introduction 

1 Incorporating data from other subjects (or sessions) into 
the learning process has gained much attention in the 
Brain-Computer Interfacing (BCI) community (TJ, j2j, G! as 
it reduces calibration times and allows to construct subject- 
independent spatial filters and/or classifiers. One popular ap- 
proach J4|, J5] is to regularize the covariance matrix towards 
the average covariance matrix of other subjects in order to 
improve its estimation quality. This kind of regularization is 
especially promising in small sample size settings. Another 
very recent approach to transfer learning in BCI [2| formulates 
the Common Spatial Patterns (CSP) computation as a multi- 
subject optimization problem, thus incorporates information 
from other subjects in order to construct a common feature 
space. It must be noted that both methods rely on very strong 
assumptions, namely a common underlying data generating 
process and similarity between the discriminative subspaces, 
respectively. However, due to the non-stationary nature of EEG 
and large variations between subjects these assumptions are 
hardly satisfied. This makes learning a common representation 
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or classification model very challenging, e.g. when two sub- 
jects have different signal characteristics, these methods may 
even deteriorate performance as the spatial filters or classifier 
will be regularized in the "wrong" direction. A careful subject 
selection or weighting is therefore essential for a successful 
application. 

In this paper we propose a diametrically opposite approach, 
namely instead of learning the task-relevant part from oth- 
ers, we transfer information about non-stationarities in the 
data. Our method is especially promising when significant 
changes are present in the data e.g. induced by differences 
in experimental conditions between sessions. Its underlying 
assumption is that these principal non-stationarities are similar 
between subjects, thus can be transferred, and have an adverse 
effect on classification performance, thus removing them is 
favourable. Unlike the methods presented before our approach 
reduces the shift between training and test data and does not 
assume similarity between discriminative subspaces. Note that 
we define the discriminative subspace as the subspace spanned 
by the CSP filters. One important advantage of our method is 
the fact that the negative impact on performance is limited 
when subjects have very different signal characteristics. This 
is because the spatial filters are not regularized "towards" a low 
dimensional subspace, but "away" from one. In other words 
under the assumption that the true discriminative subspace is 
smalQ compared to the data space, it is very unlikely that 
we remove a significant amount of discriminative information 
with our method. On the other hand when regularizing towards 
a small discriminative subspace we effectively disregard much 
larger amount of information (orthogonal complement of this 
subspace), thus if subjects have very different signal charac- 
teristics we may lose relevant information. Consequently, the 
importance of subject clustering or subject selection is largely 
reduced in our method. 

One scenario where transfer of information about non- 
stationarities is especially useful is an experiment with differ- 
ences in the stimulus presentation or feedback mode between 
sessions. For instance if a visual cue is presented in the test 
phase, but is lacking when calibrating the system then we 
may expect increased occipital activity in the test data due to 
additional visual processing. This increase in activity should 
be taken into account when computing the spatial filters as 
otherwise it may lead to non-stationary features. Since this 
increase is relatively stable between subjects, we can learn its 
patterns from other users and use them to extract invariant 

'This assumption is reasonable as the feature space extracted by CSP 
usually does not contain more than a few dimensions. 
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features. 

In summary, regularization towards discriminative sub- 
spaces of other users and utilization of knowledge about 
prominent changes are two complementary tasks which have 
different assumptions and scenarios of application. The reg- 
ularization approach has already been successfully applied 
in BCI studies and is especially promising when data is 
scarce and the subject similarity is high. The transfer of 
non-stationary information on the other hand is novel and 
is especially useful when common non-stationarities can be 
expected from the experiment. 

This paper is organized as follows. In the next section we 
present related work and review two state-of-the-art methods 
for between-subject transfer in BCI. In Section III we describe 
the underlying assumptions of our approach and introduce the 
algorithm. In Section IV we present and analyse results from 
toy experiments and experiments on real EEG recordings from 
two different data sets containing prominent non-stationarities 
between training and test session. We conclude in Section V 
with a discussion. 

II. Related Work 

Reliable classification under covariate shift, i.e. in situations 
where the data distribution changes between training and 
testing phase, is a topic of increasing popularity in many 
application domains of machine learning 0, Q. In particular 
it is of interest in the field of Brain-Computer Interfacing as 
the measured brain signals are highly non-stationary (8), 0, 
ifTUl . There are basically two strategies to tackle the problem of 
changing signal properties, namely adaptation of the features 
or the classifier and extraction of robust representations that are 
less affected by variations in the underlying brain processes. 
The approaches presented in this work all belong to the second 
category, thus we limit the literature review to that. 

One of the most popular feature extraction methods in BCI 
is Common Spatial Patterns (CSP) ifTTl. IfTUl. Ifl3l as it is 
well suited to discriminate between different mental states 
induced by motor imagery. A spatial filter w computed with 
CSP maximizes the variance of band-pass filtered EEG signals 
in one condition while minimizing it in the other condition. 
Since variance of a band-pass filtered signal is equal to band 
power, CSP enhances the differences in band power between 
two conditions. CSP is prone to overrating and does not ensure 
stationarity of the feature, thus many different variants robus- 
tifying the original algorithm have been proposed lfl4l . Ifl5l . 
The idea of an invariant feature space was proposed in [16] and 
was adapted in [15] where the authors introduce a stationary 
version of CSP to trade-off stationarity and discriminativity of 
the extracted features. The stationary CSP method penalizes 
filters that lead to non-stationary features, thus ensures stability 
over time and consequently better classification. Since this 
method is computed on training data and does not incorporate 
data from other subjects, it is not able to capture changes 
occurring in the transition between training and testing stage. 
A different strategy to ensure stationary of the features was 
proposed in IfTTl . |18|. The authors propose to remove the 
non-stationary subspace from data in a preprocessing step 



prior to feature computation, however, also here neither the 
shift between sessions is considered nor does the method 
incorporate data from other subjects. 

Several CSP extensions utilizing information from other 
subjects have been proposed in the context of zero-training 
BCI and small-sample setting. For instance a very recently 
proposed method |2| learns a spatial filter for a new subject 
based on its own data and that of other users. Another recent 
work [4 1 regularizes the Common Spatial Patterns (CSP) and 
Linear Discriminant Analysis (LDA) algorithms based on data 
from a subset of automatically selected subjects. A method 
that aims at zero training for Brain-Computer Interfacing 
by utilizing knowledge from the same subject collected in 
previous sessions was proposed in (TJ, fl9l . EDI . The authors 
of IS train a classifier that is able to learn from multiple 
subjects by multi-task learning. The method proposed in [5] 
uses the similarity between subjects measured by Kullback- 
Leibler divergence as weight for improving the covariance 
estimation by shrinkage. 

In the following we describe two CSP variants that incor- 
porate data from other subjects in more detail. 

The method proposed by Lotte and Guan [4| regularizes the 
estimated covariance matrix towards the average covariance 
matrix of other subjects. This kind of regularization may 
largely improve the estimation quality of the high dimensional 
covariance matrix if data is scarce. The estimation for subject 
i* can be written as 

n— 1 

S l%c = (l-A)5V, ; + A rVS liC , (1) 

i=i 

where Si* c is the covariance matrix of class c for the subject 
of interest, S, jC are the covariance matrices of the other 
£= 1 ... 7i, i 7^ i* subjects and A € [0 1] is a regularization 
parameter controlling the amount of information incorporated 
from other users. This method is based on a very restrictive 
assumption, namely the similarity between covariance matrices 
of different subjects. The authors in [4| recognized that this 
assumption is often violated due to large inter-subject vari- 
ability, thus they proposed a sequential algorithm for subject 
selection. In the following we will refer to this approach as 
covariance-based CSP (covCSP). 

The method proposed by Devlaminck et al. J2| assumes 
a similarity between spatial filters extracted from different 
subjects. The goal of this CSP variant is to construct a more 
global feature spaces by decomposing the spatial filter w t for 
each subject i into a global wo and subject specific part 

Wj=w +Vj, (2) 

and applying a single optimization framework to learn both 
types of filters 

max V w « s ^ w » (3) 

wo,v, j-^ wf(Z itl + S. ii2 )w. t + AiHwoll 2 + A2IKH 2 

The parameters Ai and A2 trade-off between the global or 
specific part of the filter. For a high value of Ai and a low 
value of A2 the vector wo is forced to zero and a specific 
filter is constructed. The opposite case forces the vector v. 
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TABLE I 

DESCRIPTION OF OUR ALGORITHM. THE NON-STATIONARY SUBSPACE IS 
COMPUTED FROM OTHER SUBJECTS % IN ORDER TO ACHIEVE INVARIANCE 
FOR USER 2*. 



Note that the I eigenvectors v 

J«l U( 2 ) 



(!) „( 2 ) 



(1) 



(2) 



For each subject i = l...n, i i* compute 



the eigenvectors 



of S* 



For each subject i select the I eigenvectors 
with largest absolute eigenvalues. 



(3) Aggregate the vectors into a matrix P. 

(4) Apply PCA to reduce the dimensionality of the 
non-stationary subspace Sp = span(P) to v. 

(5) Compute the projection matrix P 1 - to the 
orthogonal complement of Sp . 

(6) Make i* s data invariant to the changes by pro- 
jecting out non-stationar ities X = (P ± ) T P~ L X. 

(7) Compute spatial filters from X using CSP. 



to zero and more global filters are computed. Furthermore, 
one can also perform regularization by choosing both Ai and 
A2 high. The optimization is performed by Newton's method 
and conjugate constraint^] are added when extracting multiple 
spatial filters. Note that also here the assumption of similarity 
between spatial filters is very restrictive and a single objective 
function makes the optimization problem more difficult as it 
can not be formulated as a generalized eigenvalue problem. 
The authors of [2 | propose a cluster-based approach to tackle 
the problem of inter-subject variability. In the following this 
method will be referred to as multi-task CSP (mtCSP). 

III. Transferring Non-Stationarities 

In this section we introduce a novel way of using transfer 
learning in Brain-Computer Interfacing. We present a method 
that transfers non-stationary information between subjects, 
thus effectively bridges the gap between training and test data. 
Note that we do not claim that our method is the first one to 
tackle the problem of non-stationarity in BCI, there are of 
course other methods like stationary CSP [15| or adaptation 
methods [21], \22\, however, we are not aware of any multi- 
subject method that tackles the non-stationarity problem. 

A. Stationary Subspace CSP 

The goal of the stationary subspace CSP (ssCSP) method 
is to remove the subspace that contains the principal non- 
stationary directions common to most subjects prior to CSP 
computation. The algorithm is summarized in Table [I] 

In the following we briefly describe how to extract invariant 
features for subject i* by utilizing data from other users. In 
the first step of the method prominent directions of change 
are extracted from other subjects i = l...n, i 7^ i*. For 
that an eigendecomposition of the difference of the train- 
ing and test covariance matrix £' ram — £* esi is computed. 

2 The ith spatial filter w, is conjugate to the spatial filters with k = 
1 ... i — 1 with respect to Si jC , i.e. w^Si C W(. = 



. . v 1 ' with largest 

absolute eigenvalues |d^ ; |, |d^ ; | . . . |df | capture most of the 
changes occurring between training and test. The parameter I 
can be a fixed value or chosen adaptively for each subject 
e.g. by setting a threshold on the power spectrum of the 
eigendecomposition. Aggregating the eigenvectors obtained 



,<0 



from different subjects gives a matrix P = 
whose columns are the basis of the subspace of common 
non-stationarties Sp = span(P). Let P- 1 be the matrix that 
projects data to the orthogonal complement of Sp that is 
defined as S P ± = {x £ M. D : (x, y) = for all y £ Sp}. In 
order to construct invariant features for subject i* the common 
non-stationary subspace is projected out from its data X, i.e. 
X = (P J -) T P J -X is computed, and CSP is applied. Note that 
matrix P- 1 was solely computed on data from other subjects. 

The dimensionality v of the non-stationary subspace S-p 
can be reduced by applying Principal Component Analysis 
(PCA) to matrix P. This step is important as the dimen- 
sionality of S-p grows linearly with the size of P, i.e. with 
the number of subjects. By application of PCA we extract 
the low-dimensional subspace containing the most relevant 
information about non-stationarities. Note that PCA must be 
applied without mean subtraction as the column vectors of P 
are directional vectors without a common zero point. Instead of 
removing the subspace S-p completely from i*s data one could 
regularize the CSP filters towards the orthogonal subspace in 
a softer manner by adding a penalty matrix A = PP T and a 
trade-off parameter to the denominator of the CSP objective 
function (as done in ifTTl . 03)). From this perspective our 
method can be regarded as a variant of the stationary CSP 
algorithm with a penalty matrix that has been computed from 
data of other subjects and has reduced rank. 

Our approach requires setting two parameters I and v. 
The first parameter controls the number of non-stationary 
directions extracted per subject. This parameter can have a 
fixed value for all subjects or be subject dependent, e.g. by 
defining a threshold on the amount of changes one wants 
to capture. The second parameter sets the dimensionality of 
the non-stationary subspace that is removed. Note that the 
parameters can not be determined by cross-validation on the 
subject of interest as the goal of our method is to reduce the 
shift between training and test data and this does not necessary 
correlate with a performance increase on the training data. One 
approach to determine the parameters is to cross-validate the 
classification performance in a leave-one-subject-out manner 
on the other subjects. 



B. General Considerations 

There are two types of information that can be transferred 
between subjects, namely discriminative and non-stationary 
information. Note that both transfer types have different appli- 
cation scenarios e.g. discriminative information is important in 
small-sample settings as it may improve the estimation qual- 
ity of the spatial filters or classifier, whereas non-stationary 
information is valuable when common experimental-related 
changes are present in the data. Figure [T] illustrates the 
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application domains of the multi-subject methods used in this 
work. 
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Fig. 1. Overview of the two application domains of transfer learning in 
BCI. If all subjects have very different discriminative and non-stationary 
subspaces then transfer learning is not possible, thus CSP is the method 
of choice. Multi-subject methods like covCSP and mtCSP are applicable if 
common discriminative subspaces exist. The ssCSP method is designed to 
remove principal changes from data, thus it assumes common non- stationary 
subspaces. If both the discriminative and non-stationary subspaces are similar 
between subjects, then a subsequent application of ssCSP and mtCSP (or 
covCSP) will give best results. 

If there are no common discriminative and non-stationary 
subspaces in the data, then transfer learning is not applicable, 
thus CSP is the method of choice. If on the other hand the 
most discriminative or non-stationary directions are similar 
between subjects, then the multi-subject methods described in 
this paper may perform much better than CSP. Finally, if both 
types of information can be transferred between users, then a 
combination of the multi-subject methods gives best results. 

In order to chose the best method one needs to assess the 
similarity between the subjects or their discriminative and non- 
stationary subspaces. This is not an easy task and is often 
not possible e.g. the directions of change cannot be estimated 
when test data is not available. Furthermore it is common to 
perform subject selection or clustering prior to multi-subject 
learning in order to ensure a high level of similarity between 
users. However, this also requires that the subject similarity 
can be reliably estimated and that a large number of other 
subjects is available. 

All three transfer learning approaches presented in this 
paper have regularization parameters controlling the amount 
of information transferred between subjects. A bad choice 
of these parameters may negatively affect performance, espe- 
cially if subject similarity is low. Please note that the amount 
of information transferred in the ssCSP case is limited by 
the maximal dimensionality of the non-stationary subspace 
that is removed from the data^J whereas in the case of 
covCSP and mtCSP it is not limited, i.e. the classification 
may be completely based on data from other subjects. This 
is an important advantage of our multi-subject method as 
this limitation avoids a significant performance decrease when 
subject similarity is low. 

An example where transferring non-stationarities between 

3 Since we are only interested in removing the most common changes, the 
maximal size of the non- stationary subspace should not exceed a fraction of 
the data dimensionality. 
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subjects is more promising than learning the discriminative 
part is illustrated in Fig. [2] This figure shows four artificial 
subjects with varying discriminative subspaces, but common 
directions of change. In Section IV Fig. [4] we will see that 
the real EEG recordings used in this paper have exactly these 
properties. Note that most multi-subject methods for BCI 
assume similarity between discriminative subspaces, thus may 
provide suboptimal results in such a setting. We discuss this 
point in the toy example in next section. One can also see from 
the figure that both the discriminative and non-stationary sub- 
spaces are relatively small compared to the dimensionality of 
the data. This is a reasonable assumption as few CSP directions 
usually suffice to capture the relevant information and although 
a larger part of the data may show non-stationary behaviour 
only few changes can be explained by differences between 
sessions. Note that we are not assuming that discriminative and 
non-stationary subspaces are disjoint, in contrast we explicitly 
aim to extract a feature space that represents the real BCI 
related activity and ignores discriminativity that is induced 
by a particular experimental setting, e.g. involuntarily eye 
movements may produce discriminative EEG patterns when 
using visual stimuli. Since this activity is not induced by 
motor imagery but is an artefact of the experimental setting, its 
patterns become meaningless and can harm performance when 
switching to a different mode of stimulus presentation. There- 
fore removing discriminative activity that is non-stationary 
makes perfectly sense when aiming for robust classification. 
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Fig. 2. An example where transferring non-stationarities between subjects 
is more promising than learning the discriminative part. The discriminative 
subspaces vary between subjects, whereas the non-stationary subspaces stay 
the same. Both subspaces are relatively small compared to the dimensionality 
of the data. 



IV. Experimental Evaluation 
A. Toy Experiment 

In this subsection we study the stability of the three multi- 
subject methods under increasing dissimilarity between sub- 
jects. In other words we evaluate the impact on classification 
performance when moving from transferring relevant infor- 
mation to transferring meaningless information. The data set 
consists of artificially generated training and test recordings 
of five subjects. In order to separately study the effect of 
dissimilarity of the discriminative subspace and the non- 
stationary subspace, we generate the data as sum of two 
independent mixtures. In more detail, data x is generated as 
sum of a stationary noise-signal term and a non-stationary 
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Note that we call the first mixture the "noise-signal term" 
as it contains contributions from sources that are relevant for 
a particular BCI task (signal) as well as contributions from 
non-relevant sources (noise). The second mixture is called 
"noise term" as its sources are not important for classification. 
Thus the toy data is generated by a mixture model with non- 
stationary noise. The matrices A and B are random rotation 
matrices mixing the (non-)discriminative and (non-)stationary 
sources and the sources are normally-distributed (with zero 
mean), mutually independent and independent in time. In 
order to approximate the properties of real data we restrict 
the discriminative and non-stationary subspaces to be low- 
dimensional. 

The following parameters are used for the experiments. The 
discriminative subspace is spanned by 6 sources s dls with 
variance 0.8 in one condition and 0.1 in the other one and 
the non-discriminative subspace consists of 74 sources s ndls 
with fixed variance of 0.1. The 75 stationary sources s stat 
have variance 1 in both the training and test data set, whereas 
the variance of the 5 non-stationary sources s nstat is 1 in the 
training data set and 3 in the test data set. For each artificial 
subject we generate 100 trials per condition, each consisting 
of 100 data points, for both the training and the test set. As in 
the real experiments described later in this section we extract 
three CSP filters per class and use log-variance features and 
a LDA classifier. We determine the parameters for the multi- 
subject methods by cross-validating classification performance 
in a leave-one-subject-out manner on the other users. The 
following experiments were performed on this toy data set 
using 100 repetitions. 

In the first experiment we fix matrix B for all subjects, but 
increase the distance between the mixing matrix A of subject 
1 and the mixing matrices of the other subjects by adding an 
increasing amount of randomness while making sure that it 
still remains a rotation matri)J3 In other words we simulate 
the case of increasing dissimilarity between discriminative 
subspaces of subject 1 and the other artificial users. The results 
for the three multi-subject methods are summarized in the top 
row of Fig. [3] Each boxplot shows the distribution of clas- 
sification error rates of subject 1 for increasing dissimilarity 
values r\. Furthermore the median CSP error rate is plotted 
as green curve. We see from the figure that methods that 
transfer discriminative information between subjects, namely 
covcsp and mtcsp, significantly decrease error rates when 
the dissimilarity between the mixing matrices A of subject 
1 and the others is low. However, if the information that 
is transferred becomes more and more random the methods 
become arbitrarily bad. The stationary subspace CSP method is 
not affected by increased dissimilarity of the mixing matrices 

4 Matrix A is constructed as a matrix exponent of a random antisymmetric 
matrix M, i.e. A = e M . By adding a random matrix S to M we obtain M2 = M 
+ 7/ S. The new rotation matrix A2 can be computed as A2 = e5' M2_M2 ). 
The weight r\ controls the distance between A and A2. 



A as it does not transfer discriminative information. It is able 
to improve classification performance as the non-stationary 
subspace remains the same for all subjects (matrix B is 
constant). 

In the second experiment we simulate the opposite case, 
namely we fix A and increase the dissimilarity of B between 
subject 1 and the others. The middle row of Fig. [3] shows the 
results for this case. We can observe a stable improvement 
of the methods covcsp and mtcsp because the discriminative 
subspaces are the same for all subjects irrespectively of B. 
The figure shows an improved performance (decrease in error 
rates) for the ssCSP method when the dissimilarity between 
the non-stationary subspaces is low and a performance drop 
when it is high. However, the important point here is that in 
contrast to the discriminativity transfer in the last experiment 
the performance loss is minimal, actually the performance 
goes back to CSP level. This increased robustness can be 
explained with a lower risk of losing important information 
when regularizing the solution away from a small subspace. 
Although the transferred non-stationary information becomes 
more and more meaningless when distance between the mixing 
matrices B increases, classification accuracy does not decrease 
on average since only few directions are removed from data. 
Note that this asymmetric behaviour of covCSP, mtCSP and 
ssCSP highly depends on the size of the discriminative and 
non-stationary subspaces, the selection of regularization pa- 
rameters and of course if subject (pre)selection is used or not. 

In the final experiment we let both matrices A and B be 
either different or the same between subject 1 and the other 
users (bottom row of Fig. [3j. In the first case multi-subject 
methods have no advantage over CSP as there is no meaningful 
information to be transferred. On the contrary, the methods 
transferring discriminative information may even lose perfor- 
mance as the solution is regularized towards a non-informative 
subspace. In the other case when both subspaces stay constant 
over subjects we observe a significance performance gain of 
all multi-subject methods. Since the non-stationarity problem 
is more severe than the estimation problem, we obtain best 
results for both the ssCSP method and the combination of 
ssCSP and mtCSP (denoted as ss+mtCSP). 

B. Data Set 

Two different data sets are used for the real-data experiment. 
The first one consists of two calibration (i.e. without feedback) 
recordings from five healthy participants. The volunteers per- 
formed motor imagery of two limbs, specifically "left hand" 
and "foot". The cues indicating the stimulus were presented 
either visually (with an arrow appearing in the center of 
the screen) or auditory (a voice announcing the task to be 
performed), resulting in two different datasets for each user. In 
this experiment, the training data set was the calibration with 
visual stimuli and the test data set the calibration with auditory 
stimuli. A time segment located from 750ms to 3500ms after 
the cue instructing the subject to perform motor imagery is 
extracted from each trial and the signal is band-pass filtered 
in 8-30 Hz using a 5-th order Butterworth filter. Both the 
training and test set contain 132 trials, equally distributed 
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Fig. 3. Results of the three multi-subject methods on toy data. The upper row shows the case when discriminative subspaces become more and more dissimilar 
but the non-stationarities stay the same for all subjects. One can see that covcsp and mtcsp improve classification performance when subjects are similar, 
but when the difference between them becomes larger then the information transferred becomes more and more meaningless, thus error rates increase almost 
to chance level. The ssCSP method improves classification accuracy as it removes non-stationarities and is not affected by differences in the discriminative 
subspaces. The middle row shows results for the opposite case, namely constant discriminative subspaces but different non-stationary directions. The ssCSP 
method improves classification accuracy when the information transferred is meaningful, but does not lead to a significant increase in error rates when this 
is not the case. This effect is due to the asymmetry of regularizing towards and away from a small subspace. The bottom row shows the performance of all 
methods in the extreme case when both subspaces are either different or common between subjects. 



for each class. The data was recorded at 1000 Hz using a 
multichannel system with 85 electrodes densely covering the 
motor cortex. After filtering, it was down-sampled to 100 Hz. 
The features are extracted as log-band power on CSP filtered 
channels (three filters per class) and Linear Discriminant 
Analysis (LDA) is used for classification. 

The second set of recordings is the data set IVa ll23ll 
from BCI Competition III 11241 consisting of EEG recordings 
from five healthy subjects performing right hand and foot MI 
without feedback. Two types of visual cues, a letters appearing 
behind a fixation cross and a randomly moving object, shown 
for 3.5 s were used to indicate the target class. The presentation 
of target cues were intermitted by periods of random length, 
1.75 to 2.25 s, in which the subject could relax. The EEG 
signal was recorded from 118 Ag/AgCl electrodes, band-pass 



filtered between 0.05 and 200 Hz and downsampled to 100 Hz, 
so that 280 trials are available for each subject. We manually 
selected 68 electrodes densely covering the motor cortex and 
divided the data into a training and testing set based on the 
type of cue. Note that this division does not coincide with 
the one used for the competition, but in our experiments the 
subjects Bl and B3 have 210 training trials (3 runs) and 70 
test trials (1 run) and the other users have an equal number of 
140 trials (2 runs) in each set. We extracted a time segment 
located from 500ms to 2500ms after the cue instructing the 
subject to perform motor imagery and band-pass filtered the 
signal in 8-30 Hz using a 5-th order Butterworth filter. 

In addition to standard CSP we compute spatial 
filters with covCSP using the training covariance 
matrices of other subjects as regularization target 
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and a wide range of trade-off parameters A = 

o, lcr 5 , ltr 4 , icr 3 , i(r 2 , icr\ .2, .3, .4, .5, .6, .7, .8, .9, 1. 

We also apply mtCSP using training data from other subjects 
and different trade-off parameters for Ai and A2, namely 
10 -4 , 10 -3 . . . 10 3 , 10 4 . The optimization is initialized with 
the spatial filters obtained by CSP. Finally the ssCSP 
approach is used with I = 1 ... 10 and v = 1 ... 10. We 
apply cross-validation in a leave-one-subject-out manner 
on the other subjects and use classification performance 
as selection criterion. In order to allow better comparison 
between methods and reduce complexity we do not use 
subject selection or subject clustering. Note that all analysis 
and interpretation is performed on the first data set. 

C. Initial Analysis 

In an initial analysis we study the similarity between 
users in order to evaluate whether multi-subject 
CSP methods are at all applicable. For this we first 
measure the distance between the covariance matrices 
of subjects i and j by symmetric Kullback-Leibler 
Divergence f) KL = B KL (Af(0, £;) || W(0, £,•)) + 

Da-l (A/XO,^) || A/"(0,£;)f] Table summarizes the 
results for each subject, it shows the average distance 
between the training/test covariance matrices of different 
subjects and the distance between training and test covariance 
matrix for the same user. One can see that variations between 
subjects are up to two orders larger than differences between 
training and test sessions. This indicates that transferring 
discriminative information between users may be highly 
unreliable. For subjects A4 and A5 we can also see a clear 
relation between classification error rate (see Table |IH} and 
divergence between training and test data. 

In Fig. |4] we analyse the similarity of subspaces extracted 
from different users. We measure similarity as mean of squared 
cosines of the principal angles 9k between the subspaces^] 
This corresponds to the amount of energy preserved when 
projecting data from one subspace to the other, thus higher 
values indicate closer subspaces. Considering all principal 
angles gives a clearer picture of the relation between two 
subspaces than when restricting the analysis to the largest 
principal angles as the latter one tends to become 90° very 
fast. We extract two types of subspaces, namely discrimina- 
tive and non-stationary ones. The discriminative subspace is 
constructed from the CSP spatial filters (eigenvectors of (4)) 
with largest eigenvalues after pooling the filters from both 
classes. The non-stationary subspace is constructed from the 
prominent non-stationary directions (eigenvectors with largest 
absolute eigenvalues) between training and test. From the plot 
we clearly see that the discriminative subspaces (red line) are 
not very similar between different users, the similarity is close 
to random (black dashed line), whereas the similarity between 



dominant non-stationary subspaces (blue line) is significant. 
This is an important insight and the main motivation of our 
method. 



5 The Kullback-Leibler Divergence between 

Gaussians is denned as f kl(A/o||A/i) = 

I (tr (s^Sq) + (Mi - W ) T SrVi - Mo) - In (gffa) - ft) . 

6 Principal angles are defined recursively as cos(0j.) = 
max„ 6 j max„gg u T v = u^Vk subject to ||u|| = \\v\\ = 1, u T Uj = 
0, v T Vi = 0, i = 1, . . . , ft — 1. Note that there exist an equality between 
the canonical correlation and the cosine of principal angles. 
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Fig. 4. Similarity between subspaces of different subjects measured as 
canonical correlation, or equivalently the mean of squared cosines of the 
principal angles. Each square and circle correspond to one comparison 
between two users, whereas the solid lines represent the mean similarities. 
We see that in contrast to the dominant non-stationary directions (blue line) 
the discriminant subspaces (red line) are quite different between subjects. 



D. Performance Comparison 

Table [Hi] summarizes the performance results for both data 
sets. We clearly see that performance can be improved by 
incorporating data from other users, however, not all subjects 
profit equally. As mentioned before ssCSP has a different focus 
than covCSP and mtCSP, namely it tackles the non-stationarity 
problem and not the estimation problem. Therefore it is not 
surprising that some users like A4, Bl and B3 significantly 
improve when mtCSP is applied and others like Al, A4 and 
B5 profit from the application of ssCSP. Note that the latter 
subjects have a large shift between training and test (see Table 
P) . We would also like to point out that in contrast to covCSP 
and mtCSP there is no significant decrease in performance 
when applying the ssCSP method. This observation is in line 
with the results from the toy experiment. The bottom row 
of Table [Til] shows the results of the combination of ssCSP 
and mtCSP with the regularization parameters obtained when 
applying both methods individually. In other words we first 
project out the non-stationary subspace obtained by ssCSP 
and then compute the spatial filters with mtCSP using the 
regularization parameters obtained when applying it to the 
original data. We see that this method gives the best perfor- 
mance results as it combines both transfer learning approaches. 

E. Interpretation 

In the following we analyse the non-stationarity activity 
patterns and investigate the reasons for the performance gain 
in more detail on the first subject Al. 

Each row of Fig. [5] visualizes the five most non-stationary 
directions of a subject. One can see that the patterns are 
highly similar between users. This similarity is also reflected in 
Fig. [4] The non-stationarity patterns clearly show a relation to 
the change in the experimental conditions, i.e. the transition 
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TABLE II 

This table shows the average distance, measured by symmetric Kullback-Leibler Divergence, between the covariance matrices 
of different subjects (first and second row) and between the training and test covariance matrices for the same subject. we 
clearly see that the differences between subjects are up to two orders larger than the differences between training and test. 



Description 


Al A2 A3 A4 A5 


Average T)kl to the training covariance matrices of other subjects 

Average Hkl to the test covariance matrices of other subjects 

Djfj between training and test covariance matrix for particular subject 


490 799 650 853 657 
995 1803 1799 1947 1377 
62 27 57 110 15 



TABLE III 

Comparison of error rates for different multi-subject CSP variants. All subjects profit from the information transfer except 
USERS A5 AND B2. The best overall performance can be achieved by the combination of ssCSP and mtCSP. 







Audio-Visual Data Set 






BCI Competition III 






Overall 




Subject 


Al 


A2 


A3 


A4 


A5 


Bl 


B2 


B3 


B4 


B5 


Mean 


Median 


Std 


CSP 


79.5 


80.0 


65.8 


59.2 


94.2 


66.1 


96.4 


58.2 


88.8 


81.0 


76.9 


79.8 


14.0 


covCSP 


79.5 


79.2 


60.8 


60.8 


93.3 


66.1 


96.4 


58.2 


58.5 


71 


72.4 


68.6 


14.2 


mtCSP 


72.7 


70.0 


48.3 


75.0 


92.5 


72.3 


94.6 


68.4 


65.6 


82.1 


74.2 


72.5 


13.4 


ssCSP 


87.1 


80.8 


67.5 


65.8 


93.3 


67.0 


94.6 


58.2 


89.3 


85.7 


78.9 


83.3 


13.1 


ss+mtCSP 


87.9 


80.8 


66.7 


69.2 


93.3 


71.4 


94.6 


66.3 


88.4 


84.9 


80.4 


82.9 


11.1 



from a visual mode of stimulus presentation to an auditory 
one, as they focus mainly on occipital and temporal activity. 
From neuroscience it is well-known that occipital areas are 
responsible for visual processing and temporal regions are 
associated with auditory tasks. In other words the shift between 
training and test session is minimized by projecting out activity 
that is related to the presentation mode of the stimulus. 



when applying ssCSP there is only little difference between 
both distributions. 




Fig. 5. Visualization of most non-stationary directions for each subject (in the 
rows). We clearly see that some of the patterns e.g. the first and third of subject 
A3, indicate a change in activity over occipital and temporal areas. These 
brain regions are mainly responsible for visual and auditory processing. Thus 
the principal non- stationary directions capture the change in the experimental 
conditions from a visual mode of stimulus presentation to an auditory one. 

In Fig. [6] we see the change between the training and test 
features of subject 1 for CSP and ssCSP. We selected this 
user as he shows a significant increase in performance. We 
plot the two feature dimensions that correspond to the most 
discriminative filters in both conditions. We see that in the case 
of CSP the feature distribution obtained from training data is 
different from that computed on the test set. On the other hand 
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Fig. 6. Visualization of the two most discriminative dimensions for subject 
Al. A significant change in the feature distribution between training (blue 
circles) and test (red crosses) can be observed for the standard CSP method, 
whereas when applying ssCSP this change becomes almost negligible. 



F. Learning from Noise ? 

An interesting question is whether the prominent changes 
occur in the discriminative or in the non-discriminative part 
of the signal. In order to study this question we compute the 
similarity between the non-stationary and the discriminative 
subspace for each subject. As before the similarity is mea- 
sured as mean square cosine of principal angles. The overall 
similarity between both subspaces is low, it is around 0.04. 
In order to assess the significance of this value we estimate 
an empirical distribution of the similarity scores by generating 
10000 random subspaces. In our simulation only around seven 
subspace (out of 10000) pairs have a similarity score smaller 
than 0.04, thus the similarity of the discriminative and non- 
discriminative subspaces is significantly lower than random. 
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This indicates that most of the shift is present in the non- 
discriminative part of the data. 

In order to assess how well the principal non-stationarities 
can be estimated from the non-discriminative part of the 
data, we project out the (discriminative) CSP directions from 
the data of each subject prior to computation of the non- 
stationary subspace. When applying this approach to both 
data sets we obtain an average performance of 78.1 i.e. the 
performance loss compared to the original ssCSP method 
(78.9) is minimal and not significant. This is a surprising 
result as it indicates that the non-discriminative noise signal 
subspace can aid to construct invariant features. This subspace 
is generally removed (by applying CSP) prior to classification 
and regarded as non-task related noise. Thus we need to 
revisit the statement that noise never helps as it can be used 
to improve classification accuracy and reduce the need of 
adaptation in a BCI scenario. 

V. Discussion 

Non-stationarities in BCI experiments are rather common 
and they are notoriously hard to model. In this work we 
showed that information about dominant changes can be 
transferred between subjects and is mainly contained in the 
non-discriminant (noise) part of the data. Thus, somewhat 
paradoxically, the noise part can be the key to improve 
classification accuracy, as it allows to define invariant features. 
We showed quantitatively that prominent non-stationarities re- 
sulting from changes in the experimental conditions are much 
more stably estimated between subjects than their respective 
discriminant (information carrying) subspaces. Note that the 
non-stationarity information transferred between subject ap- 
pears physiologically interpretable and meaningful. Moreover 
reducing non-stationarities from data is seen to be more robust 
to perturbations than learning discriminative subspaces, thus 
subject selection or weighting is not required. We will in the 
future investigate theoretical limits and applications of our 
concept to transfer learning and covariate shift models. Finally 
we intend to evaluate our approach in an online BCI setting 
and investigate ways to transfer information obtained from 
different imaging modalities [25], l26l . 
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